STATS 32 Session 2: Packages and Data Frames

Kenneth Tay

Oct 4, 2018

Gear Up for Social Science Data Extravaganza

Recap of session 1

Vectors

vec <- c(10, 5, 20)
vec <- 1:10 * 2
vec
##  [1]  2  4  6  8 10 12 14 16 18 20
vec[c(1,5)]
## [1]  2 10

Lists

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))
cars$make
## [1] "Honda"
cars[["models"]]
## [1] "Fit"     "CR-V"    "Odyssey"

Agenda for today

What is a data frame?

Example of a dataset

R’s syntax for creating data frames

df <- data.frame(votes_dem = c(486351, 318, 5904), 
                 votes_gop = c(91189, 211, 10239))
df
##   votes_dem votes_gop
## 1    486351     91189
## 2       318       211
## 3      5904     10239

Data frames “under the hood”

is.list(df)
## [1] TRUE
df$votes_dem
## [1] 486351    318   5904

R’s syntax

3 different types of syntax:

Functions: R’s workhorse

A function is a named block of code which

(Source: codehs.gitbooks.io)

We use functions in R all the time

We’ve already seen a number of functions in R! For example,

is.character("123")
## [1] TRUE

The function is.character takes the input given to it in the parentheses and returns TRUE or FALSE, depending on whether the input is of type character or not.

Others we’ve seen: str, log, typeof, rm, c, list, length, …

We can see what a function does by typing in ? followed by the function name in the R console.

?is.character

Structure of an R function call

A function call consists of:

mean(): An example

Take the mean of c(1,3,NA).

mean(c(1,3,NA))
## [1] NA
mean(c(1,3,NA), na.rm = TRUE)
## [1] 2

sample(): Description

sample(): Usage

What comes after the = sign: default value for that argument

sample(): Arguments

sample(): Details

sample(): Value

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)
##  [1]  4  9  7  6 10  1  8  5  3  2
sample(1:10, 10, TRUE)
##  [1]  9  2  2  2 10  1  2  4  8  3
sample(size = 5, 1:10)
## [1] 5 1 3 4 2

Functions can be “chained” together

Commands are evaluated “from inside out”

is.character(as.character(123))
## [1] TRUE

Packages

Today’s dataset: Fuel economy

(Source: SuperCars)

fueleconomy: Package information on CRAN

https://cran.r-project.org/web/packages/fueleconomy/index.html









Optional material

Why use functions?

Reason #1: Functions make code more understandable.

Example: What is the line of code below trying to do?

x <- c(4, 234, 1, 50, 764)
x <- (x - min(x)) / (max(x) - min(x))
#> [1] 0.003931848 0.305373526 0.000000000 0.064220183 1.000000000
rescale01 <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
rescale01(c(4, 234, 1, 50, 764))
## [1] 0.003931848 0.305373526 0.000000000 0.064220183 1.000000000

Why use functions?

Reason #2: Functions make code more concise.

list$a <- (list$a - min(list$a)) / (max(list$a) - min(list$a))
list$b <- (list$b - min(list$b)) / (max(list$b) - min(list$b))
list$b <- (list$c - min(list$c)) / (max(list$c) - min(list$c))

vs.

list$a <- rescale01(list$a)
list$b <- rescale01(list$b)
list$c <- rescale01(list$c)

Can you spot the mistake in the first block?

Why use functions?

Reason #3: Functions enable code reuse and code changes.

list$a <- (list$a - min(list$a)) / (max(list$a) - min(list$a))
list$b <- (list$b - min(list$b)) / (max(list$b) - min(list$b))
list$c <- (list$c - min(list$c)) / (max(list$c) - min(list$c))

vs.

rescale01 <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
list$a <- rescale01(list$a)
list$b <- rescale01(list$b)
list$c <- rescale01(list$c)

What if I want to rescale the entries to be between 0 and 2 instead?

List of useful packages

Measures of central tendency

Measures of spread